Taking an airplane is one of the most important and efficient ways to travel. However, many travelers have experienced delayed flight. Which airline carriers delayed most often? Which airports have highest probability to make you wait for a long time? Also, which day of week and which month of year are better for your journey without severe delay?
The aim of this presentation is to produce a graphical summary of the airline performance data.
For this project, we will be exploring the airline on-time performance from a publicly available dataset from the stat-computing website (see reference at the end of this document).
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008.
The dataset is packed in a yearly chunks from 1987 to 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
To make sure we will be not overwhelmed by the size of the data, we will create a subset file that will be used for our Exploratory Data Analysis task. We will first explain how we retreive the data, build a database and retreive a random subset for this project.
As explain, the data come as a list of CSV files (compressed as bz2). To decompress the data we used this command :
bzip2 -d *.bz2
Here is the details of the files once decompressed :
air-mic:dataset mic0331$ ls -l
total 23494656
-rw-r-----@ 1 mic0331 staff 127162942 Apr 2 18:19 1987.csv
-rw-r-----@ 1 mic0331 staff 501039472 Apr 2 18:21 1988.csv
-rw-r-----@ 1 mic0331 staff 486518821 Apr 2 18:30 1989.csv
-rw-r-----@ 1 mic0331 staff 509194687 Apr 2 18:22 1990.csv
-rw-r-----@ 1 mic0331 staff 491210093 Apr 2 18:39 1991.csv
-rw-r-----@ 1 mic0331 staff 492313731 Apr 2 18:22 1992.csv
-rw-r-----@ 1 mic0331 staff 490753652 Apr 2 18:41 1993.csv
-rw-r-----@ 1 mic0331 staff 501558665 Apr 2 18:22 1994.csv
-rw-r-----@ 1 mic0331 staff 530751568 Apr 2 18:41 1995.csv
-rw-r-----@ 1 mic0331 staff 533922363 Apr 2 18:42 1996.csv
-rw-r-----@ 1 mic0331 staff 540347861 Apr 2 18:42 1997.csv
-rw-r-----@ 1 mic0331 staff 538432875 Apr 2 18:42 1998.csv
-rw-r-----@ 1 mic0331 staff 552926022 Apr 2 18:44 1999.csv
-rw-r-----@ 1 mic0331 staff 570151613 Apr 2 18:44 2000.csv
-rw-r-----@ 1 mic0331 staff 600411462 Apr 2 18:46 2001.csv
-rw-r-----@ 1 mic0331 staff 530507013 Apr 2 18:46 2002.csv
-rw-r-----@ 1 mic0331 staff 626745242 Apr 2 18:46 2003.csv
-rw-r-----@ 1 mic0331 staff 669879113 Apr 2 18:46 2004.csv
-rw-r-----@ 1 mic0331 staff 671027265 Apr 2 18:47 2005.csv
-rw-r-----@ 1 mic0331 staff 672068096 Apr 2 18:51 2006.csv
-rw-r-----@ 1 mic0331 staff 702878193 Apr 2 18:51 2007.csv
-rw-r-----@ 1 mic0331 staff 689413344 Apr 2 18:50 2008.csv
In order to considerably speed up access to the data we will load it into a database. We will use sqlite, an open-source portable database.
sqlite3 ontime.sqlite3
Next, we create a table with fields that match the csv files.
sqlite> create table ontime (
Year int,
Month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier varchar(5),
FlightNum int,
TailNum varchar(8),
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin varchar(3),
Dest varchar(3),
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode varchar(1),
Diverted varchar(1),
CarrierDelay int,
WeatherDelay int,
NASDelay int,
SecurityDelay int,
LateAircraftDelay int
);
Then load the data with the .import directive.
sqlite> .separator ,
sqlite> .import ../dataset/1987.csv ontime
sqlite> .import ../dataset/1988.csv ontime
sqlite> .import ../dataset/1989.csv ontime
sqlite> .import ../dataset/1990.csv ontime
sqlite> .import ../dataset/1991.csv ontime
sqlite> .import ../dataset/1992.csv ontime
sqlite> .import ../dataset/1993.csv ontime
sqlite> .import ../dataset/1994.csv ontime
sqlite> .import ../dataset/1995.csv ontime
sqlite> .import ../dataset/1996.csv ontime
sqlite> .import ../dataset/1997.csv ontime
sqlite> .import ../dataset/1998.csv ontime
sqlite> .import ../dataset/1999.csv ontime
sqlite> .import ../dataset/2000.csv ontime
sqlite> .import ../dataset/2001.csv ontime
sqlite> .import ../dataset/2002.csv ontime
sqlite> .import ../dataset/2003.csv ontime
sqlite> .import ../dataset/2004.csv ontime
sqlite> .import ../dataset/2005.csv ontime
sqlite> .import ../dataset/2006.csv ontime
sqlite> .import ../dataset/2007.csv ontime
sqlite> .import ../dataset/2008.csv ontime
Unfortunately this also imports the column headers, so we remove them with this sql command.
delete from ontime where typeof(year) == "text";
To speed up access to the data, we’ll also want to add indices. The code below adds a few, but we’ll want to think about the needs and add more if needed. (It’s most efficient to create the indices after all the data has been loaded)
sqlite> create index year on ontime(year);
sqlite> create index date on ontime(year, month, dayofmonth);
sqlite> create index origin on ontime(origin);
sqlite> create index dest on ontime(dest);
# not provided in github but the file can be re-built from the previous scripts
ritadb <- dbConnect(RSQLite::SQLite(), './ontime.sqlite3')
from_db <- function(sql) {
dbGetQuery(ritadb, sql)
}
from_db("select count(*) from ontime")
# this query return
# count(*)
# 1 123534969
So we have 123.534.969 records in the database and it has a size of 17.24 GB.
air-mic:data mic0331$ ls -l
total 33671104
-rw-r--r-- 1 mic0331 staff 17239605248 Apr 3 09:51 ontime.sqlite3
The previously run query used to retreive the number of records of the database is already taking a lot of time to be run in R therefore we will generate a subset of random rows taken from the created SQL. This file will contain 800.000 records from the main database.
sqlite> .headers on
sqlite> .mode csv
sqlite> .output rita.csv
sqlite> select * from ontime order by random() limit 800000;
sqlite> .output stdout
The generated file has now a reasonable size of 78.4 Mb
air-mic:data mic0331$ ls -l rita.csv
-rw-r--r-- 1 mic0331 staff 78448194 Apr 3 11:10 rita.csv
Let’s now load the dataset onto a dataframe.
rita = read.csv('./rita.csv')
dim(rita)
## [1] 800000 29
# also lead other utilitiy files
airports = read.csv('./airports.csv')
carriers = read.csv('./carriers.csv')
We will now add several potentially usefull columns and make some cleanup on the data.
# We adjust the date features formats
rita$Month <- sprintf("%02d", rita$Month)
rita$DepTime <- sprintf("%04d", rita$DepTime)
rita$ArrTime <- sprintf("%04d", rita$ArrTime)
rita$DayOfWeek <- sprintf("%02d", rita$DayOfWeek)
rita$DayofMonth <- sprintf("%02d", rita$DayofMonth)
# Differences between the Departure delay and Arrival Delay
rita <- mutate(rita, delta = DepDelay - ArrDelay)
# speed in miles / hour
rita <- mutate(rita, speed = Distance / (AirTime / 60))
# full date of the departure
rita <- mutate(rita, date = as.Date(paste(rita$Year, rita$Month, rita$DayofMonth, sep=""), "%Y%m%d"))
There are 800,000 flights from 1997 till 2008 with 32 features.
str(rita)
## 'data.frame': 800000 obs. of 32 variables:
## $ Year : int 1997 1994 1990 1993 1996 2006 1993 2008 2003 2007 ...
## $ Month : chr "08" "06" "11" "03" ...
## $ DayofMonth : chr "18" "16" "03" "29" ...
## $ DayOfWeek : chr "01" "04" "06" "01" ...
## $ DepTime : chr "1038" "2049" "1855" "0810" ...
## $ CRSDepTime : int 940 1900 1855 809 1245 1545 730 845 1004 1740 ...
## $ ArrTime : chr "1236" "2243" "2012" "0939" ...
## $ CRSArrTime : int 1130 2035 2020 943 1425 1632 950 1034 1624 2110 ...
## $ UniqueCarrier : Factor w/ 29 levels "9E","AA","AQ",..: 27 6 26 2 8 28 4 4 14 2 ...
## $ FlightNum : int 1094 489 359 1412 827 2599 504 401 81 177 ...
## $ TailNum : Factor w/ 12932 levels "","-N037M","-N047M",..: 7632 NA NA NA 11664 1327 NA 7698 8089 4594 ...
## $ ActualElapsedTime: int 118 114 77 89 116 53 144 114 271 388 ...
## $ CRSElapsedTime : int 110 95 85 94 100 47 140 109 260 390 ...
## $ AirTime : int 108 NA NA NA 91 26 NA 88 240 345 ...
## $ ArrDelay : int 66 128 -8 -4 16 -2 4 24 19 22 ...
## $ DepDelay : int 58 109 0 1 0 -8 0 19 8 24 ...
## $ Origin : Factor w/ 333 levels "ABE","ABI","ABQ",..: 223 92 294 86 244 148 236 287 240 165 ...
## $ Dest : Factor w/ 335 levels "ABE","ABI","ABQ",..: 284 104 285 32 21 176 174 238 51 285 ...
## $ Distance : int 671 487 372 597 526 127 834 569 1999 2586 ...
## $ TaxiIn : int 4 NA NA NA 14 5 NA 17 7 5 ...
## $ TaxiOut : int 6 NA NA NA 11 22 NA 9 24 38 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : Factor w/ 5 levels "","A","B","C",..: NA NA NA NA NA 1 NA 1 NA 1 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CarrierDelay : int NA NA NA NA NA 0 NA 7 NA 22 ...
## $ WeatherDelay : int NA NA NA NA NA 0 NA 0 NA 0 ...
## $ NASDelay : int NA NA NA NA NA 0 NA 5 NA 0 ...
## $ SecurityDelay : int NA NA NA NA NA 0 NA 0 NA 0 ...
## $ LateAircraftDelay: int NA NA NA NA NA 0 NA 12 NA 0 ...
## $ delta : int -8 -19 8 5 -16 -6 -4 -5 -11 2 ...
## $ speed : num 373 NA NA NA 347 ...
## $ date : Date, format: "1997-08-18" "1994-06-16" ...
The data comes originally from RITA where it is described in detail.
Here is a summary of these features:
columns = read.csv('./columns.csv', sep=";")
columns
## Description Name
## 1 1987-2008 Year
## 2 1-12 Month
## 3 1-31 DayofMonth
## 4 1 (Monday) - 7 (Sunday) DayOfWeek
## 5 actual departure time (local, hhmm) DepTime
## 6 scheduled departure time (local, hhmm) CRSDepTime
## 7 actual arrival time (local, hhmm) ArrTime
## 8 scheduled arrival time (local, hhmm) CRSArrTime
## 9 unique carrier code UniqueCarrier
## 10 flight number FlightNum
## 11 plane tail number TailNum
## 12 in minutes ActualElapsedTime
## 13 in minutes CRSElapsedTime
## 14 in minutes AirTime
## 15 arrival delay, in minutes ArrDelay
## 16 departure delay, in minutes DepDelay
## 17 origin (IATA airport code) Origin
## 18 destination (IATA airport code) Dest
## 19 in miles Distance
## 20 taxi in time, in minutes TaxiIn
## 21 taxi out time in minutes TaxiOut
## 22 was the flight cancelled? Cancelled
## 23 reason for cancellation CancellationCode
## 24 1 = yes, 0 = no Diverted
## 25 in minutes CarrierDelay
## 26 in minutes WeatherDelay
## 27 in minutes NASDelay
## 28 in minutes SecurityDelay
## 29 in minutes LateAircraftDelay
Note:
Reason for cancellation can take 4 possible value (A = carrier, B = weather, C = NAS, D = security).
We have 29 carriers in the dataset.
We have 335 destination airports and 333 origin airports.
As explained previously, some features has been built to facilitate the exploration (e.g. speed, time).
Other observations :
Most flights has no delay.
The volume of flight is growing year after year.
We see an increase of delays year after year.
The dataset does not contain cancelled or diverted flights.
The airports with the highest number of plane are those having the highest number of flights.
Let’s first check the summary of some of the variables.
1/ “Distance”" in miles
summary(rita$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 11.0 306.0 544.0 701.4 936.0 4983.0 1263
2/ “ArrDelay”" Delay, in minutes
summary(rita$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1162.000 -7.000 0.000 7.064 11.000 1541.000 16745
3/ “Airtime” in minutes
summary(rita$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1162.000 -7.000 0.000 7.064 11.000 1541.000 16745
4/ “CRSDepTime” Scheduled departure time
summary(rita$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1162.000 -7.000 0.000 7.064 11.000 1541.000 16745
5/ “CRSArrTime” Actual arrival time
summary(rita$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1162.000 -7.000 0.000 7.064 11.000 1541.000 16745
According to our observations, it is effective to devide the numeric variable into the ordered, reasonable range for our analysis. So we split the Distance, ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure that all the observation is included in given range.
We copy the original data because we won’t use the next transformation later into our analysis.
data = rita
data$Distance = ordered(cut(data$Distance, c(0, 300, 600, 1000, Inf)), labels = c("Short", "Medium", "Long", "Too long"))
ggplot(data, aes(x=Distance)) +
geom_histogram(fill="lightblue") +
theme_bw()
Most flights are coverining medium distance. The number of long and short distance are comparable.
data$ArrDelay = ordered(cut(data$ArrDelay, c(0, 25, 50, 80, Inf)), labels = c("On-Time", "Delayed", "Intermediate-Delayed", "Much-Delayed"))
ggplot(data, aes(x=ArrDelay)) +
geom_histogram(fill="lightblue") +
theme_bw()
Out of the NA value, most of the flights land on-time.
data$AirTime = ordered(cut(data$AirTime, c(-1, 50, 100, 200, 300, Inf)), labels = c("Too-Short", "Short", "Intermediate", "Long", "Too-Long"))
ggplot(data, aes(x=AirTime)) +
geom_histogram(fill="lightblue") +
theme_bw()
Most flight are short in time (less than an hour) or intermediate (from one to 2 hours); long flight (more than 3 hours) are less frequent.
data$CRSDepTime = ordered(cut(data$CRSDepTime, c(-1, 600, 1200, 1800, Inf)), labels = c("Overnight", "Morning", "Afternoon", "Evening"))
ggplot(data, aes(x=CRSDepTime)) +
geom_histogram(fill="lightblue") +
theme_bw()
Most of the flights are scheduled during the morning or afternoon and less frequently the evening.
data$CRSArrTime = ordered(cut(data$CRSArrTime, c(-1, 600, 1200, 1800, 2359)), labels = c("Overnight", "Morning", "Afternoon", "Evening"))
ggplot(data, aes(x=CRSArrTime)) +
geom_histogram(fill="lightblue") +
theme_bw()
Most flights land the afternoon and the evening. This make sence based on the previous histogram.
str(carriers)
## 'data.frame': 1491 obs. of 2 variables:
## $ Code : Factor w/ 1490 levels "02Q","04Q","05Q",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Description: Factor w/ 1491 levels "40-Mile Air",..: 1328 1331 479 887 620 1296 523 12 876 751 ...
data[,"UniqueCarrier"]<-carriers[data[ ,"UniqueCarrier"], "Description"]
qplot(x = UniqueCarrier, data = data, fill = UniqueCarrier) +
theme_bw() +
coord_flip() +
xlab(NULL) +
guides(fill=guide_legend(ncol=2)) +
theme(legend.position="bottom", legend.direction = "horizontal")
Valley Air Express, Tradewind aviation, ACM Air Charter Gmbh and Canada 3000 airlines Ltd. runs most of the airplane in the U.S. in this dataset.
Let’s show the number of flight (flight volume) we have per year.
ggplot(rita, aes(x = Year)) +
geom_histogram(fill="lightblue", binwidth=1, col="black") +
labs(title = "Distribution of flight per year") +
xlab('Year') +
theme_bw()
We see from this plot a continuity in the growth of number of flight. There are less flight in the 80th than in the 2008. This is algin with a logical thinking and also the size of the files used to load the database. The file for 1987 is smaller than other, this is also proven in the histogram. With that we can assume the random selection of the sample has been made correctly and we are not privileging a year over another.
The next line graph show the same thing in a different way.
rita.rita_per_year <- rita %>%
group_by(Year) %>%
summarise(n = n()) %>%
arrange(Year)
head(rita.rita_per_year)
## Source: local data frame [6 x 2]
##
## Year n
## 1 1987 8428
## 2 1988 33711
## 3 1989 32479
## 4 1990 34157
## 5 1991 32926
## 6 1992 33280
ggplot(rita.rita_per_year) +
geom_point(aes(x=Year, y=n)) +
geom_line(aes(x=Year, y=n), col = "lightblue") +
scale_x_continuous(limits = c(1987, 2008), breaks = seq(1987, 2008, 5)) +
ylab("Count") +
theme_bw() +
theme(
panel.grid.major.y = element_line(colour = "black", linetype = 3, size = .5),
panel.background = element_blank(),
axis.title.x = element_text(size=16),
axis.text.x = element_text(size=14, angle=45, hjust=1, vjust=1),
axis.title.y = element_text(size=16, angle = 90),
axis.text.y = element_text(size=14),
strip.background = element_rect(color="white", fill="white"),
strip.text = element_text(size=16)
)
We see a drop in the number of fligh between 2001 and 2003, probably due to the event of September 11, 2001, however, flight volume is on the increase - dramatically so since 2000.
Let’s now focus the analysis on the delays. The next four plots are showing on the left the number (up) and the density (down) of departure delay and on the right the same for arrival delay.
str(rita$DepDelay)
## int [1:800000] 58 109 0 1 0 -8 0 19 8 24 ...
str(rita$ArrDelay)
## int [1:800000] 66 128 -8 -4 16 -2 4 24 19 22 ...
p1 <- ggplot(rita, aes(x=DepDelay / 60)) +
geom_histogram(fill="lightblue", binwidth=.1) +
scale_x_continuous(limits = c(-2, 2), breaks = seq(-2, 2, 1)) +
geom_vline(aes(xintercept=mean(DepDelay / 60, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=.5) +
geom_vline(aes(xintercept=median(DepDelay / 60, na.rm=T)), # Ignore NA values for median
color="green", linetype="dashed", size=.5) +
xlab('Departure Delays (hours)') +
theme_bw()
p2 <- ggplot(rita, aes(x=ArrDelay / 60)) +
geom_histogram(fill="lightblue", binwidth=.1) +
scale_x_continuous(limits = c(-2, 2), breaks = seq(-2, 2, 1)) +
geom_vline(aes(xintercept=mean(ArrDelay / 60, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=.5) +
geom_vline(aes(xintercept=median(ArrDelay / 60, na.rm=T)), # Ignore NA values for median
color="green", linetype="dashed", size=.5) +
xlab('Arrival Delays (hours)') +
theme_bw()
p3 <- ggplot(rita, aes(x=DepDelay / 60)) +
geom_histogram(fill="lightblue", binwidth=.1, aes(y = ..density..), color = "black") +
geom_density(color="black") +
scale_x_continuous(limits = c(-2, 2), breaks = seq(-2, 2, 1)) +
geom_vline(aes(xintercept=mean(DepDelay / 60, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=.5) +
geom_vline(aes(xintercept=median(DepDelay / 60, na.rm=T)), # Ignore NA values for median
color="green", linetype="dashed", size=.5) +
xlab('Departure Delays (hours)') +
theme_bw()
p4 <- ggplot(rita, aes(x=ArrDelay / 60)) +
geom_histogram(fill="lightblue", binwidth=.1, aes(y = ..density..), color = "black") +
geom_density(color="black") +
scale_x_continuous(limits = c(-2, 2), breaks = seq(-2, 2, 1)) +
geom_vline(aes(xintercept=mean(ArrDelay / 60, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=.5) +
geom_vline(aes(xintercept=median(ArrDelay / 60, na.rm=T)), # Ignore NA values for median
color="green", linetype="dashed", size=.5) +
xlab('Arrival Delays (hours)') +
theme_bw()
grid.arrange(p1, p2, p3, p4, ncol = 2)
We see that most flight have no delays. The distribution for the departure seems a bit negatively skewed compare to arrival where the distribution is normal.
rita.no_missing <- filter(rita, !is.na(DepDelay) & !is.na(ArrDelay))
rita.by_year <- group_by(rita.no_missing, Year)
# the *1.0 is used to avoid loss of precision when attemting to convert a numeric to an integer
rita.DepDelays <- summarise(rita.by_year,
mean = mean(DepDelay, na.rm = TRUE),
median = median(DepDelay * 1.0, na.rm = TRUE),
q75 = quantile(DepDelay, .75, na.rm = TRUE),
q25 = quantile(DepDelay, .25, na.rm = TRUE),
over_15 = mean(DepDelay > 15, na.rm = TRUE),
over_30 = mean(DepDelay > 30, na.rm = TRUE),
over_60 = mean(DepDelay > 60, na.RM = TRUE))
rita.ArrDelays <- summarise(rita.by_year,
mean = mean(ArrDelay, na.rm = TRUE),
median = median(ArrDelay * 1.0, na.rm = TRUE),
q75 = quantile(ArrDelay, .75, na.rm = TRUE),
q25 = quantile(ArrDelay, .25, na.rm = TRUE),
over_15 = mean(ArrDelay > 15, na.rm = TRUE),
over_30 = mean(ArrDelay > 30, na.rm = TRUE),
over_60 = mean(ArrDelay > 60, na.RM = TRUE))
p1 <- ggplot(rita.DepDelays, aes(x = factor(Year), y = mean)) +
geom_bar(stat = "identity", fill="lightblue") +
geom_line(aes(y=q75, group=1, color="red")) +
geom_point(aes(y=q75)) +
geom_line(aes(y=q25, group=1, color="blue")) +
geom_point(aes(y=q25)) +
geom_line(aes(y=median, group=1, color="yellow")) +
geom_point(aes(y=median)) +
scale_color_manual("Quartile",labels = c("Q25 % (lower quartile)", "Q75 % (upper quantile)", "median"), values = c("blue", "red", "yellow")) +
xlab('Year') +
ggtitle('Departure') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = .5, vjust=0.5)) +
theme(legend.position="bottom", legend.direction = "horizontal")
p2 <- ggplot(rita.ArrDelays, aes(x = factor(Year), y = mean)) +
geom_bar(stat = "identity", fill="lightblue") +
geom_line(aes(y=q75, group=1, color="red")) +
geom_point(aes(y=q75)) +
geom_line(aes(y=q25, group=1, color="blue")) +
geom_point(aes(y=q25)) +
geom_line(aes(y=median, group=1, color="yellow")) +
geom_point(aes(y=median)) +
scale_color_manual("Quartile",labels = c("Q25 % (lower quartile)", "Q75 % (upper quantile)", "median"), values = c("blue", "red", "yellow")) +
xlab('Year') +
ggtitle('Arrival') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = .5, vjust=0.5)) +
theme(legend.position="bottom", legend.direction = "horizontal")
grid.arrange(p1, p2, ncol = 1)
The first quartile is at the 25th percentile, we see a clear drop in this quartile and an increase of the upper quantile (75%). This means that we have less and less of the data smaller than the first quartile so a substential increase of the delay year after year. This is true for both departure and arrival but certainly more pronounced for departure.
rita$Diverted <- factor(rita$Diverted,
levels = c(0,1),
labels = c("No", "Yes"))
barplot(prop.table(table(rita$Diverted)))
In this dataset, we have almost all flight not diverted
rita$Cancelled <- factor(rita$Cancelled,
levels = c(0,1),
labels = c("No", "Yes"))
barplot(prop.table(table(rita$Cancelled)))
Also, not many flight cancelled.
We have access to the cancellation code, let’s see the proportion of cancellation code.
rita.rita_cancellation_code <- filter(rita, !is.na(CancellationCode) & Cancelled == "Yes")
rita.rita_cancellation_code$CancellationCode <- factor(rita.rita_cancellation_code$CancellationCode,
levels = c("A","B","C","D"),
labels = c("Carrier", "Weather", "NAS", "Security"))
barplot(prop.table(table(rita.rita_cancellation_code$CancellationCode)))
Let’s look at the number of flights per destination.
ggplot(rita, aes(x = Dest)) +
geom_bar() +
labs(x = "Destination",
y = "Number of Flights") +
theme_bw()
There are way too many destinations. In this case, the best thing to do is aggregate the results outside of ggplot and subset to the top 50 destinations.
rita.flights <- rita %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum)) %>%
arrange(desc(flight_count))
rita.flight_top_50 <- head(rita.flights, n=50)
Try plotting just the top ten results.
ggplot(rita.flight_top_50[1:10, ], aes(x = Dest, weight = flight_count)) +
geom_bar() +
labs(x = "Destination",
y = "Number of Flights") +
theme_bw()
Since we’ve already got the top destinations in a table format with the number of plane, it’s a good opportunity to display a visualization of these two variables.
ggplot(rita.flight_top_50[1:20, ], aes(x = Dest, y = flight_count)) +
geom_point(fill="lightblue", aes(size=plane_count)) +
labs(title = "Number of flights for the top 20 destination airports") +
xlab('Number of flights') +
theme_bw() +
theme(legend.position="bottom")
No susprise, the airports with the highest number of plane are those having the highest number of flights.
ggplot(rita, aes(x = Distance)) +
geom_histogram() +
theme_bw()
The histogram of distance is left skewed so I’m going to transform the data using a log transform.
p1 <- qplot(data = rita, x = Distance, binwidth = 100, fill = I("#099DD9")) +
ggtitle('Distance')
p2 <- qplot(data = rita, x = Distance, binwidth = .01, fill = I("#F79420")) +
ggtitle('Distance (log10)') +
scale_x_log10()
grid.arrange(p1, p2, ncol = 2)
I log-transformed the left skewed distance distribution. However this transformation does not provide any added value or sence to the data.
Due to my computer limits, for the next operation I only selected 100 flights.
# sample 100 flights from the data set
set.seed(20150422)
rita.rita_samp <- rita[sample(1:length(rita$speed), 100), ]
ggpairs(rita.rita_samp[, c('DayOfWeek', 'Year', 'AirTime', 'ArrDelay', 'DepDelay', 'delta', 'speed', 'Distance', 'CarrierDelay', 'NASDelay', 'LateAircraftDelay')], lower=list(continuous="smooth", params=c(colour="blue")),
diag=list(continuous="bar", params=c(colour="blue")),
upper=list(params=list(corSize=6)), axisLabels='show') +
theme(legend.position = "none",
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
panel.border = element_rect(linetype = "dashed", colour = "black", fill = NA))
str(rita.rita_samp)
From this plot we say that that we don’t have a lot of obvious correlation between features however, let’s look at some of them
For large scale commercial airplane, speed in flight is 500 MPH. Let’s use 600 MPH as a limit.
rita.rita_speed_distance = subset(rita, rita$speed >= 0 & rita$speed <= 600)
p <- ggplot( data = rita.rita_speed_distance, aes(x=speed, y=Distance)) +
geom_point() +
xlim(0, 600) +
theme_bw()
print(p)
As speed increase, the distance increase. This makes a lot of sence. The relationship between speed and distance appears to be exponential rather than linear.
m1 <- lm(Distance ~ speed, data = rita.rita_speed_distance, na.action=na.omit)
p + geom_abline(m1$intercept, m1$slope, colour = "red")
summary(m1)
##
## Call:
## lm(formula = Distance ~ speed, data = rita.rita_speed_distance,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1551.6 -269.2 -89.8 156.4 3953.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.222e+03 2.945e+00 -414.8 <2e-16 ***
## speed 4.922e+00 7.303e-03 674.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 413.8 on 544417 degrees of freedom
## Multiple R-squared: 0.4549, Adjusted R-squared: 0.4549
## F-statistic: 4.543e+05 on 1 and 544417 DF, p-value: < 2.2e-16
print(m1)
##
## Call:
## lm(formula = Distance ~ speed, data = rita.rita_speed_distance,
## na.action = na.omit)
##
## Coefficients:
## (Intercept) speed
## -1221.692 4.922
Based on the R^2 value, speed explains about 45 percent of the variance in distance. The fitted regression line is : Distance = -1221.7 + 4.922 speed. According to this model, average distance increases by 4.922 miles for each additional mile per hour of speed.
The residuals are the vertical distances from the observed distance to the line.
plot(m1, which=1, add.smooth=FALSE)
The residual plot has some unisually large residuals labeled; observations 222924 for example. We also observe that residuals are closer to zero at low distances; the variance of the residuals is not constant across all speed, but increasing with speed.
To predict flight distance for a given speed, the predict method can be used. For example, the predicted distance for a speed of 300 mph is obtained by
results = data.frame(speed=300)
results$m1 <- predict(m1, results)
The predicted distance of flight is 255 miles.
For inference, we requires some asumptions about the distribution of the error term. We assume that the random errors are independant and indentically distributed as Normal random variables.
Residual plots help us to assess the fit of the model and the assumptions for the error.
Here we are requested two plots: a plot of residuals vs fit and a QQ plot to check for normality of residuals.
plot(m1, which=1:2)
In the residual plot a curve has been added. This curve is a fitted lowess (local polynomial regression) curve, called a smoother. The residuals are assummed independant and identivally distributed, but there is a pattern evident. The residuals have a “U” shape or bowl shape. This pattern could indicate that there is a variable missing from the model. In the QQ plot, normally distributed residuals should lie approximately along the reference line shown in the plot. We clearly see that this is not really the case. Therefore, our model regression model is not really a good fit for predicting distance of flight for a given speed.
Let’s try an exponential regression.
results <- data.frame(speed = seq(300, 600, by = 50))
m2 <- lm(Distance ~ speed + I(speed^2), data = rita.rita_speed_distance, na.action=na.omit)
m3 <- lm(Distance ~ speed + I(speed^2) + I(speed^3), data = rita.rita_speed_distance, na.action=na.omit)
m4 <- lm(Distance ~ I(speed^2), data = rita.rita_speed_distance, na.action=na.omit)
results$m2 <- predict(m2, results)
results$m3 <- predict(m3, results)
results$m4 <- predict(m4, results)
results <- melt(results, id.vars = "speed", variable.name = "model",
value.name = "fitted")
ggplot(results, aes(x = speed, y = fitted)) +
xlim(0, 600) +
geom_point(data = rita.rita_speed_distance, aes(x = speed, y = Distance)) +
geom_line(aes(colour = model), size = 1) +
theme_bw()
With the model m3 we have a R^2 of 52.7% which is much better. Also the residual plot has a line shape but still the QQplot normally distributed residuals does not lie along the reference line shown in the plot.
summary(m3)
##
## Call:
## lm(formula = Distance ~ speed + I(speed^2) + I(speed^3), data = rita.rita_speed_distance,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2392.2 -208.6 -59.6 116.0 3959.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.653e+02 1.915e+01 39.95 <2e-16 ***
## speed -5.133e+00 1.698e-01 -30.23 <2e-16 ***
## I(speed^2) 9.118e-03 4.864e-04 18.75 <2e-16 ***
## I(speed^3) 7.426e-06 4.494e-07 16.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 385.4 on 544415 degrees of freedom
## Multiple R-squared: 0.5272, Adjusted R-squared: 0.5272
## F-statistic: 2.023e+05 on 3 and 544415 DF, p-value: < 2.2e-16
plot(m3, which=1:2)
Finally, let’s look at the speed vs distance with less overplotting.
ggplot( data = rita.rita_speed_distance, aes(x=speed, y=Distance)) +
geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
xlim(0, 600) +
theme_bw()
From this last plot we start to see something interesting, we see several points which are disconnected from the main dots. This probably show different kind of plane. We start to see 3 exponentionals curves.
rita$DayOfWeek <- factor(rita$DayOfWeek,
levels = c('01','02','03','04','05','06','07'),
labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saterday", "Sunday"))
ggplot(rita, aes(x=DayOfWeek)) +
geom_histogram() +
theme_bw()
We see Saterday and Sunday are two days with less delay, let’s look more closely
rita.year_day_of_week <- rita %>%
group_by(Year, DayOfWeek) %>%
summarise(ArrDelay_mean = mean(ArrDelay * 1.0, na.rm=T),
ArrDelay_median = median(ArrDelay * 1.0, na.rm=T),
n = n()) %>%
arrange(Year, DayOfWeek)
ggplot(rita.year_day_of_week, aes(x=Year, y=ArrDelay_mean)) +
geom_line(aes(colour=DayOfWeek)) +
ylab('Arrival Delay (min)') +
theme_bw()
Best days to travel and avoid delays are Saturdays, Sunday but also Tuesdays or Wednesdays. Fridays are bad for delays. Also note the pick than the drop arround 2001, this is probably representative to the 9/11 event.
Let’s look more closely to the 8 last years of the data set.
ggplot(subset(rita.year_day_of_week, rita.year_day_of_week$Year >2000), aes(x=DayOfWeek, y=ArrDelay_mean)) +
geom_point(aes(colour=DayOfWeek)) +
facet_grid(.~Year, scales="free") +
ylab('Arrival Delay (min)') +
theme_bw() +
theme(axis.text.x=element_blank())
We recognise the same pattern as in the previous plot but also a substential increase of the delay. On the left the data points are clustered on the bottom part of the plot where on the right (more recently), the data points are stacked on the upper side of the drawing.
Next, let’s visualize the relationship between taxi-in and the origin. Taxi indicates the movement of an aircraft on the ground. Here we will analyse the taxiing immediately after the landing (taxi-in).
rita.taxi_in_origin <- filter(rita, !is.na(TaxiIn))
rita.taxi_in_origin <- tablefreq(rita[,c("TaxiIn","Origin")])
rita.taxi_in_origin <- head(arrange(rita.taxi_in_origin, -freq), n=100)
## Bar plot
ggplot(aes(x=TaxiIn, weight= freq, colour = Origin), data = rita.taxi_in_origin, position ="dodge") +
geom_bar() +
theme_bw()
For most of the flights, the taxiing time spent after landing is between 3 and 5 minutes. Let’s compute the relative frequencies for each group.
rita.taxi_in_origin <- rita.taxi_in_origin %>%
group_by(Origin) %>%
mutate( ngroup= sum(freq), wgroup= freq/ngroup)
ggplot(aes(x=TaxiIn, weight=wgroup, colour = Origin),data = rita.taxi_in_origin) +
geom_density() +
theme_bw()
From this last plot, we see that the time spend on taxiing is similar from airport to airport.
rita.dest_highest_delays <- rita %>%
group_by(Dest) %>%
summarise(
arr_delay = mean(ArrDelay, na.rm = TRUE),
n = n()) %>%
arrange(desc(arr_delay))
ggplot(rita.dest_highest_delays) +
geom_text(aes( x=arr_delay, y=n, label=Dest)) +
theme_bw()
This plot is a bit messy but show some interesting facts. Some destination airports (ORD, ATL, DFW) have very low average delay despit the number of fligh they accumulate but on the other hand, some airports (RDR) have very high average delay with very few flights.
rita.per_hour <- rita %>%
filter(Cancelled == 'No' & !is.na(rita$AirTime)) %>%
mutate(time = AirTime / 60) %>%
group_by(time) %>%
summarise(
arr_delay = mean(ArrDelay, na.rm = TRUE),
n = n()
)
qplot(time, arr_delay, data = rita.per_hour, size = n) +
scale_size_area() +
theme_bw()
ggplot(filter(rita.per_hour, n > 30), aes(time, arr_delay)) +
geom_point() +
stat_smooth() +
theme_bw()
As the flight time increase the arrival time is increasing. We see a positive increase of this trend from one hour of flight to 2 hours, then it remain constant from 2 to 4 than the data becomes more dispersed.
The next last plot also show the density of flights. As the time increase, we have less flights.
qplot(time, arr_delay, data = filter(rita.per_hour, n > 30), size = n) +
scale_size_area() +
theme_bw()
Goal : fit a linear model to each day, predicting delay from time of day
# check whether rows contain NAs for the DepTime
row.has.na <- apply(rita, 1, function(DepTime){any(is.na(DepTime))})
sum(row.has.na)
## [1] 582795
rita.rita_wNA <- rita[!row.has.na,]
rita.usual <- rita.rita_wNA %>%
mutate(time = as.numeric(substring(rita.rita_wNA$DepTime, 1, 2)) + as.numeric(substring(rita.rita_wNA$DepTime, 3, 4)) / 60) %>%
filter(as.numeric(substring(rita.rita_wNA$DepTime, 1, 2)) >= 5, as.numeric(substring(rita.rita_wNA$DepTime, 1, 2)) <= 18)
rita.usual <- mutate(rita.usual, date = as.Date(paste(rita.usual$Year, rita.usual$Month, rita.usual$DayofMonth, sep=""), "%Y%m%d"))
p <- ggplot(rita.usual, aes(x=time, y=DepDelay)) +
geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
scale_y_continuous(limits = c(0, 200),
breaks = c(0, 50, 100, 150, 200)) +
theme_bw()
print(p)
## Warning: Removed 86582 rows containing missing values (geom_point).
lmfit = lm(rita.usual$DepDelay ~ rita.usual$time)
lmfit
##
## Call:
## lm(formula = rita.usual$DepDelay ~ rita.usual$time)
##
## Coefficients:
## (Intercept) rita.usual$time
## -9.101 1.349
summary(lmfit)
##
## Call:
## lm(formula = rita.usual$DepDelay ~ rita.usual$time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -166.70 -12.06 -6.21 0.22 1268.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1014 0.2255 -40.36 <2e-16 ***
## rita.usual$time 1.3492 0.0174 77.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.24 on 180291 degrees of freedom
## Multiple R-squared: 0.03226, Adjusted R-squared: 0.03225
## F-statistic: 6010 on 1 and 180291 DF, p-value: < 2.2e-16
p + geom_abline(lmfit$intercept, lmfit$slope, colour = "red")
## Warning: Removed 86582 rows containing missing values (geom_point).
We clearly see from the R^2 and the fitting that our model is really poor and we cannot use as a preditice tool for the delays… There are probably many other variables that must be included to imrpove this model. Also the scatter plot show a lot of noise.
Let’s visualize the delay over the months and see which months are better
In a table format, here is the delay per carrier for each months
rita.monthly_delay.wide <- rita %>%
mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>%
summarize(delay=sum(ArrDelay, na.rm=T)) %>%
dcast(month ~ carrier, value.var = "delay")
head(rita.monthly_delay.wide, 5)
## month 9E AA AQ AS B6 CO DH DL EA EV F9 FL
## 1 01 1950 63280 -49 13236 2965 33521 4227 79173 8731 7504 897 5552
## 2 02 3352 46207 301 12515 3686 34173 4973 67292 4615 9582 1278 7408
## 3 03 848 54400 202 10384 3651 37288 2489 68618 2407 5081 736 6009
## 4 04 168 43703 157 7326 4234 24763 -187 54101 2336 5069 143 2522
## 5 05 84 46803 428 9244 1064 27205 3822 51005 1754 6109 1254 1926
## HA HP ML (1) MQ NW OH OO PA (1) PI PS TW TZ
## 1 240 18315 1013 20825 34947 7715 18653 1503 6443 482 23086 793
## 2 -40 15332 158 18544 30701 6112 13315 188 5165 -66 14391 792
## 3 416 15296 209 17864 30555 3601 10085 509 7228 -224 13495 629
## 4 -163 11800 21 13263 24760 2138 3820 773 4346 24 10393 -17
## 5 670 11683 -33 18494 20182 4045 3318 1283 6272 NA 10969 822
## UA US WN XE YV
## 1 72761 49985 48624 10512 6383
## 2 54607 47417 45734 7755 7314
## 3 55744 45619 47241 12410 4962
## 4 43070 34658 35066 9085 4344
## 5 55233 33505 35974 9072 4204
If we now try to visualize these data, we do it like so
rita.monthly_delay <- rita %>%
mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>%
summarize(delay=mean(ArrDelay, na.rm=T))
rita.monthly_delay$month <- factor(rita.monthly_delay$month,
levels = c('01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'),
labels = c("January","February","March","April","May","June","July","Augus","September","October","November","December"))
ggplot(NULL, aes(x=month, y=delay, fill=carrier, alpha=.2)) +
geom_bar(data=rita.monthly_delay %>% filter(delay > 0),
stat='identity',
position = "identity",
colour="black") +
geom_bar(data=rita.monthly_delay %>% filter(delay < 0),
stat='identity',
position = "identity",
colour="black") +
labs(x = "Month",
y = "Delay (hours)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = .5, vjust=0.5))
This plot is not the best to display a trend because the coloring stacking is hidding things however we see that some airport are better in delay at the begining of the year and other at the end of the year. So we start to see a seasonal effect in the delays. The next plot is just an improvement of this visualization to see more relevant points.
ggplot(NULL, aes(x=month, y=delay, color=carrier)) +
geom_point(data=rita.monthly_delay %>% filter(delay > 0),
stat='identity',
position = "identity") +
geom_point(data=rita.monthly_delay %>% filter(delay < 0),
stat='identity',
position = "identity") +
labs(x = "Month",
y = "Delay (hours)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = .5, vjust=0.5))
location <- airports %>%
select(Dest = iata, name = airport, lat, long)
rita.rita_full_dest <- rita %>%
group_by(Dest) %>%
filter(!is.na(ArrDelay)) %>%
summarise(
arr_delay = mean(ArrDelay),
n = n()
) %>%
arrange(desc(arr_delay)) %>%
left_join(location)
rita.rita_full_dest[["delay_type"]] = ordered(cut(rita.rita_full_dest$arr_delay, c(0, 5, 25, 50, Inf)), labels = c("On-Time", "Delayed", "Intermediate-Delayed", "Much delayed"))
# load the state data
data(state)
states <- data.frame(state.abb, state.name, state.area, state.center,
state.division, state.region, state.x77)
# Remove the data points out of the USA
maxLong <- max(states$x)
minLong <- min(states$x)
maxLat <- max(states$y)
minLat <- min(states$y)
rita.rita_full_dest <- subset(rita.rita_full_dest, long <= maxLong & long >= minLong & lat <= maxLat & lat >= minLat)
map <- map_data("state")
ggplot(data = map, aes(x = long, y = lat, group = group)) +
# draw outlines of the state
geom_polygon() +
geom_path(colour = 'gray', linestyle = 2) +
coord_map() +
geom_text(data = states, aes(x = x,
y = y,
label = state.abb,
group = NULL),
size = 4,
colour="white") +
geom_point(data = rita.rita_full_dest,
aes(x = long, y = lat, size = arr_delay, colour = delay_type, group = NULL)) +
geom_text(aes(x = long, y = lat, label=Dest, group = NULL),
data = subset(rita.rita_full_dest, rita.rita_full_dest$arr_delay >= 25),
hjust=-.2, fontface='bold', col = 'red') +
scale_colour_discrete(name="Delay Type") +
scale_size_continuous(range = c(2,10), name = "Delays") +
ggtitle("Airports with the highest delay (in minutes)") +
theme_bw() +
theme(legend.position="bottom")
This plot show the airports delays on a map. For that we need to connect to the airports dataset. The location with a bigger point show where the delays are the highest.
From the map we clearly see that we have more destination with delays (green dots) versus on-time (red dots).
In continuity with the previous plot, we will show the airports with the highest delay from those having at minimum 15 minutes delays.
rita.rita_full_dest[["delay_25"]] = ifelse(rita.rita_full_dest$arr_delay >= 25, "delay_25", "notDelay_25")
ggplot(data = subset(rita.rita_full_dest, arr_delay > 15)) +
geom_bar(aes(x=name, y=arr_delay, fill=delay_25), stat='identity') +
scale_y_continuous(limits = c(0, 100)) +
geom_text(data=subset(rita.rita_full_dest, arr_delay >= 25), aes(x=name, y=arr_delay, label=Dest), hjust=-.1, fontface=3) +
coord_flip() +
xlab(NULL) +
ylab("Delays in minutes") +
theme_bw() +
scale_fill_manual(values = c("delay_25" = "darkblue", "notDelay_25" = "lightgrey")) +
theme(legend.position="none")
For the third plot we will try to visualize the impact of the September 11, 2001 attacks on the volume of passengers for the busiest airport of USA in that year. For that we will create a new dataset containing information about 5 airports between July and December 2001. For the selection of the airports we will take only the busiest airports by passenger in 2001 (wikipedia link). These airports are :
Instead of working with an SQLite db, this time we will download the file through R and manipulate the data from R studio directly.
file.name <- paste(2001, "csv.bz2", sep = ".")
if (!file.exists(file.name)) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/", 2001, ".csv.bz2",
sep = "")
cat("Downloading missing data file ", file.name, "nn", sep = "")
download.file(url.text, file.name)
}
Next, we extract the file.
bzip2 -d 2001.csv.bz2
We load the file in R studio and make some basic manipulations.
rita_2001 <- read.csv("2001.csv")
# we take the destination airports of interest
rita_2001 <- subset(rita_2001, Dest %in% c('ATL', 'ORD', 'LAX', 'DFW', 'DEN'))
# we take the data from July to December
rita_2001 <- subset(rita_2001, Month %in% c(7, 8, 9, 10, 11, 12))
write.csv(rita_2001, "rita_2001_busy.csv", row.names=FALSE, na="")
Finaly we visualise the data.
rita_2001 <- read.csv("rita_2001_busy.csv")
rita_2001$Month <- sprintf("%02d", rita_2001$Month)
rita_2001$DayofMonth <- sprintf("%02d", rita_2001$DayofMonth)
# full date of the departure
rita_2001 <- mutate(rita_2001, date = as.Date(paste(rita_2001$Year, rita_2001$Month, rita_2001$DayofMonth, sep=""), "%Y%m%d"))
rita_2001.rita_2001_per_day <- rita_2001 %>%
group_by(date, Dest) %>%
filter(!is.na(ArrDelay)) %>%
summarise(
arr_delay = mean(ArrDelay),
n = n()
) %>%
arrange(date, Dest) %>%
left_join(location)
ggplot(rita_2001.rita_2001_per_day) +
geom_line(aes(x=date, y=n, col=name), size = 1) +
geom_vline(aes(xintercept=as.numeric(as.Date("2001-09-11"))), linetype=1, size = 1, colour = "grey") +
geom_text(aes(x=as.Date("2001-09-15"), label="Sep `01 - W.T.C. Attack", y=950, hjust = 0), colour="grey", angle=0, text=element_text(size=11)) +
xlab("Month") +
ylab("Volume") +
ggtitle("Volume at major Airports, impact of the WTC attack") +
theme_bw() +
theme(
panel.grid.major.y = element_line(colour = "black", linetype = 3, size = .5),
panel.background = element_blank(),
axis.title.x = element_text(size=16),
axis.text.x = element_text(size=14, angle=45, hjust=1, vjust=1),
axis.title.y = element_text(size=16, angle = 90),
axis.text.y = element_text(size=14),
strip.background = element_rect(color="white", fill="white"),
strip.text = element_text(size=16),
legend.position="bottom",
legend.direction = "vertical"
)
This plot show the impact of the WTC attacks on the volume of passenger for the busiest airports of US. We clearly see the drop on the 9/11. The recovery of the volume has been almost immediate once the flight traffic has been re-opened after the event but not at the same level as before the attacks.
An other element to notice is the uneven effect visible for each airport. This is the effect of the weekend/seasonality which drop the number of flights.
I was interested in analyzing this particular set of data. First because it is an accessible piece of information which does not require any background domain knowledge in order to run an analysis. Second, the very large volume of data was a very good chalange to me. I spent a lot of time just loading, parsing and creating a usable dataset in R studio.
I struggled a bit at first with exactly what I expected to learn from this dataset. However, I did a lot of quick coding to reveal some interesting information about the data.
The analysis performed reveal some obvious and non obvious trend in the data but many times I was a bit confused about how to interpret the findings. There are several variables not included as an explanatory or response variable in the dataset but can affect the interpretation of relationships between variables (lurking variable). For example, the aircraft maintenance, the staffing, … All these features may impact the delay of a flight.
The various regression model I tried to implement was not really conclusive, especially when I try to fit a linear model to each day, predicting delay from time of day. The nature of the data did not reveal obvious correlation.
I was pleased with how easy it is to produce a simple map of the world in R. And plotting points and lines was easy and straight forward.
It really helped to just systematically go through each variable with some quick and dirty plots and see what the data showed me. There is still much to be learned from this dataset, and more questions to find and answer. For example :
This has really helped me flesh out my varied interests in data analytics. I look forward to analyzing more intriguing datasets in the future, and hope you found some of the insights into this dataset as interesting as I did.